Surpassing 10Gb/s over Tailscale · Tailscale

您所在的位置：网站首页 › e3 1230 i5 › Surpassing 10Gb/s over Tailscale · Tailscale

Surpassing 10Gb/s over Tailscale · Tailscale

#Surpassing 10Gb/s over Tailscale · Tailscale| 来源: 网络整理| 查看: 265

Hi, it’s us again. You might remember us from when we made significant performance-related changes to wireguard-go, the userspace WireGuard® implementation that Tailscale uses. We’re releasing a set of changes that further improves client throughput on Linux. We intend to upstream these changes to WireGuard as we did with the previous set of changes, which have since landed upstream.

With this new set of changes, Tailscale joins the 10Gb/s club on bare metal Linux, and wireguard-go pushes past (for now) the in-kernel WireGuard implementation on that hardware. How did we do it? Through UDP segmentation offload and checksum optimizations. You can experience these improvements in the current unstable Tailscale client release, and also in Tailscale v1.40, available in the coming days. Continue reading to learn more, or jump down to the Results section if you just want numbers.

Background

The data plane in Tailscale is built atop wireguard-go, a userspace WireGuard implementation written in Go. wireguard-go acts as a pipeline, receiving packets from the operating system via a TUN interface. It encrypts them, assuming a valid peer exists for their addressed destination, and sends them to a remote peer via a UDP socket. The flow in the opposite direction is similar. Packets from valid peers are decrypted after being read from a UDP socket, then are written back to the kernel’s TUN interface driver.

The changes we made in v1.36 modified this pipeline, enabling packet vectors to flow end-to-end, rather than single packets. The techniques applied on both ends of the pipeline reduced the number of system calls per packet, and on the TUN side they reduced the cost of moving a packet through the kernel networking stack.

This greatly improved throughput, and we have continued to build upon it with the changes we describe in this post.

Baseline

Disclaimer about benchmarks: This post contains benchmarks! These benchmarks are reproducible at the time of writing, and we provide details about the environments we ran them in. But benchmark results tend to vary across environments, and they also tend to go stale as time progresses. Your mileage may vary.

Before getting into the details of what we changed, we need to record some baselines for later comparison. These benchmarks are conducted using iperf3, as single stream TCP tests, with cubic congestion control. All hosts are running Ubuntu 22.04 with the latest available Linux kernel for that distribution.

We baselined throughput for wireguard-go@052af4a and in-kernel WireGuard. These tests were conducted between two pairs of hosts:

2 x AWS c6i.8xlarge instance types 2 x “bare metal” servers powered by i5-12400 CPUs & Mellanox MCX512A-ACAT NICs

For consistency, the c6i.8xlarge instance type is the same we used in the precursory blog post. The instances are in the same region and availability zone:

ubuntu@c6i-8xlarge-1:~$ ec2metadata | grep -E 'instance-type:|availability-zone:' availability-zone: us-east-2b instance-type: c6i.8xlarge ubuntu@c6i-8xlarge-2:~$ ec2metadata | grep -E 'instance-type:|availability-zone:' availability-zone: us-east-2b instance-type: c6i.8xlarge ubuntu@c6i-8xlarge-1:~$ ping 172.31.23.111 -c 5 -q PING 172.31.23.111 (172.31.23.111) 56(84) bytes of data. --- 172.31.23.111 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4094ms rtt min/avg/max/mdev = 0.109/0.126/0.168/0.022 ms

We’ve added the i5-12400 systems for a bare metal comparison with interfaces operating above 10Gb/s. The i5-12400 CPU is a modern (released Q1 2022) desktop-class chip, available for $183 USD at the time of writing. The Mellanox NICs are connected at 25Gb/s via a direct attach copper (DAC) cable:

jwhited@i5-12400-1:~$ lscpu | grep Model.name && cpupower frequency-info -d && cpupower frequency-info -p Model name: 12th Gen Intel(R) Core(TM) i5-12400 analyzing CPU 0: driver: intel_pstate analyzing CPU 0: current policy: frequency should be within 800 MHz and 5.60 GHz. The governor "performance" may decide which speed to use within this range. jwhited@i5-12400-1:~$ sudo ethtool enp1s0f0np0 | grep Speed && sudo ethtool -i enp1s0f0np0 | egrep 'driver|^version' Speed: 25000Mb/s driver: mlx5_core version: 5.15.0-69-generic jwhited@i5-12400-2:~$ lscpu | grep Model.name && cpupower frequency-info -d && cpupower frequency-info -p Model name: 12th Gen Intel(R) Core(TM) i5-12400 analyzing CPU 0: driver: intel_pstate analyzing CPU 0: current policy: frequency should be within 800 MHz and 5.60 GHz. The governor "performance" may decide which speed to use within this range. jwhited@i5-12400-2:~$ sudo ethtool enp1s0f0np0 | grep Speed && sudo ethtool -i enp1s0f0np0 | egrep 'driver|^version' Speed: 25000Mb/s driver: mlx5_core version: 5.15.0-69-generic jwhited@i5-12400-1:~$ ping 10.0.0.20 -c 5 -q PING 10.0.0.20 (10.0.0.20) 56(84) bytes of data. --- 10.0.0.20 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4078ms rtt min/avg/max/mdev = 0.008/0.035/0.142/0.053 ms

Now for the iperf3 baseline tests.

c6i.8xlarge over in-kernel WireGuard:

ubuntu@c6i-8xlarge-1:~$ iperf3 -i 0 -c c6i-8xlarge-2-wg -t 10 -C cubic -V iperf 3.9 Linux c6i-8xlarge-1 5.19.0-1022-aws #23~22.04.1-Ubuntu SMP Fri Mar 17 15:38:24 UTC 2023 x86_64 Control connection MSS 1368 Time: Wed, 12 Apr 2023 23:56:53 GMT Connecting to host c6i-8xlarge-2-wg, port 5201 Cookie: 3jzl3sa34hkbpwbmg4dbfh6aovbknnw7x5hn TCP MSS: 1368 (default) [ 5] local 10.9.9.1 port 51194 connected to 10.9.9.2 port 5201 Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test, tos 0 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-10.00 sec 3.11 GBytes 2.67 Gbits/sec 51 1.00 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 3.11 GBytes 2.67 Gbits/sec 51 sender [ 5] 0.00-10.05 sec 3.11 GBytes 2.66 Gbits/sec receiver CPU Utilization: local/sender 5.1% (0.3%u/4.8%s), remote/receiver 11.2% (0.2%u/11.0%s) snd_tcp_congestion cubic rcv_tcp_congestion cubic

c6i.8xlarge over wireguard-go@052af4a:

ubuntu@c6i-8xlarge-1:~$ iperf3 -i 0 -c c6i-8xlarge-2-wg -t 10 -C cubic -V iperf 3.9 Linux c6i-8xlarge-1 5.19.0-1022-aws #23~22.04.1-Ubuntu SMP Fri Mar 17 15:38:24 UTC 2023 x86_64 Control connection MSS 1368 Time: Wed, 12 Apr 2023 23:55:42 GMT Connecting to host c6i-8xlarge-2-wg, port 5201 Cookie: zlcrq3xqyr6cfmrtysrm42xcg3bbjzir3qob TCP MSS: 1368 (default) [ 5] local 10.9.9.1 port 54410 connected to 10.9.9.2 port 5201 Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test, tos 0 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-10.00 sec 6.21 GBytes 5.34 Gbits/sec 0 3.15 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 6.21 GBytes 5.34 Gbits/sec 0 sender [ 5] 0.00-10.04 sec 6.21 GBytes 5.31 Gbits/sec receiver CPU Utilization: local/sender 8.6% (0.2%u/8.4%s), remote/receiver 11.8% (0.6%u/11.2%s) snd_tcp_congestion cubic rcv_tcp_congestion cubic

i5-12400 over in-kernel WireGuard:

jwhited@i5-12400-1:~$ iperf3 -i 0 -c i5-12400-2-wg -t 10 -C cubic -V iperf 3.9 Linux i5-12400-1 5.15.0-69-generic #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023 x86_64 Control connection MSS 1368 Time: Wed, 12 Apr 2023 23:41:44 GMT Connecting to host i5-12400-2-wg, port 5201 Cookie: hqkn7s3scipxku5rzpcgqt4rakutkpwybtvx TCP MSS: 1368 (default) [ 5] local 10.9.9.1 port 48564 connected to 10.9.9.2 port 5201 Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test, tos 0 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-10.00 sec 13.7 GBytes 11.8 Gbits/sec 8725 753 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 13.7 GBytes 11.8 Gbits/sec 8725 sender [ 5] 0.00-10.04 sec 13.7 GBytes 11.7 Gbits/sec receiver CPU Utilization: local/sender 26.3% (0.1%u/26.2%s), remote/receiver 17.4% (0.5%u/16.9%s) snd_tcp_congestion cubic rcv_tcp_congestion cubic

i5-12400 over wireguard-go@052af4a:

jwhited@i5-12400-1:~$ iperf3 -i 0 -c i5-12400-2-wg -t 10 -C cubic -V iperf 3.9 Linux i5-12400-1 5.15.0-69-generic #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023 x86_64 Control connection MSS 1368 Time: Wed, 12 Apr 2023 23:39:22 GMT Connecting to host i5-12400-2-wg, port 5201 Cookie: ohzzlzkcvnk45ya32vm75ezir6njydqwipkl TCP MSS: 1368 (default) [ 5] local 10.9.9.1 port 52486 connected to 10.9.9.2 port 5201 Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting 0 seconds, 10 second test, tos 0 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-10.00 sec 9.74 GBytes 8.36 Gbits/sec 507 3.01 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 9.74 GBytes 8.36 Gbits/sec 507 sender [ 5] 0.00-10.05 sec 9.74 GBytes 8.32 Gbits/sec receiver CPU Utilization: local/sender 11.7% (0.1%u/11.6%s), remote/receiver 6.5% (0.2%u/6.3%s) snd_tcp_congestion cubic rcv_tcp_congestion cubic

With the baselines captured, let’s look at some profiling data to understand where we may be bottlenecked.

Linux perf and flame graphs

The flame graphs below were rendered from perf data. They represent the amount of CPU time spent for a given function/stack. The wider the function, the more expensive it (and/or its children) are. These are interactive; you can click to zoom and hover to see percentages.

This first graph is from the iperf3 sender:

Notably, more time is being spent sending UDP packets than encrypting their payloads. Let’s take a look at the receiver:

The receiver looks fairly similar, with UDP reception being nearly equal in time spent relative to decryption.

We are using the {send,recv}mmsg() (two m’s) system calls, which help to amortize the cost of making a syscall. However, on the kernel side of the system call, we see {send,recv}mmsg() calls into {send,recv}msg() (single m). This means that we still pay the cost of traversing the kernel networking stack for every single packet, because the kernel side simply iterates through the batch.

On the TUN side of wireguard-go, we make use of TCP segmentation offload (TSO) and generic receive offload (GRO), which enable multiple TCP segments to pass through the kernel stack as a single segment:

What we need is something similar, but for UDP. Enter UDP generic segmentation offload.

UDP generic segmentation offload (GSO)

UDP GSO enables the kernel to delay segmentation of a batch of UDP datagrams in a similar fashion to the TCP variant, reducing the CPU cycles per byte cost of traversing the networking stack. Linux support was authored by Willem de Bruijn and introduced into the kernel in v4.18. UDP GSO was propelled by the adoption of QUIC in the datacenter, but its benefits are not limited to QUIC. It is best described by part of its summary commit message:

Segmentation offload reduces cycles/byte for large packets by amortizing the cost of protocol stack traversal.

This patchset implements GSO for UDP. A process can concatenate and submit multiple datagrams to the same destination in one send call by setting socket option SOL_UDP/UDP_SEGMENT with the segment size, or passing an analogous cmsg at send time.

The stack will send the entire large (up to network layer max size) datagram through the protocol layer. At the GSO layer, it is broken up in individual segments. All receive the same network layer header and UDP src and dst port. All but the last segment have the same UDP header, but the last may differ in length and checksum.”

After implementing UDP GSO on the UDP socket side of wireguard-go, the transmit direction now looks like this:

But what about the receive path? It would be ideal to optimize both directions. Paolo Abeni authored UDP generic receive offload (GRO) support, and it was introduced into the Linux kernel in v5.0. With UDP GRO the receive direction now looks like this:

Updates to the UDP man page for these new features eventually arrived, in which an important requirement for UDP GSO is described:

Segmentation offload depends on checksum offload, as datagram checksums are computed after segmentation.

Checksum offload is widely supported across ethernet devices today. It also reduces the cost of the kernel networking stack, as ethernet devices tend to have specialized hardware that is very efficient at computing RFC1071 checksums. It’s often paired with segmentation offload, which as the man page describes, may need to be performed by the layer performing segmentation.

In fact, we already have to offload checksumming inside of the TCP segmentation offload implementation in wireguard-go. The kernel hands us a “monster segment,” which we are responsible for segmenting. This includes calculating checksums for the individual segments.

TUN checksum offload

If we look back at the flame graphs we’ll find the function responsible for computing the internet checksum as part of the existing TCP segmentation offloading (tun.checksum(), inlined with tun.checksumNoFold()). It contributes to a modest percentage of perf samples (6.6% on the sender) before making any changes. After reducing the cost of the kernel’s UDP stack, the relative cost of TUN checksum offload increases with throughput, and it becomes our next candidate to optimize.

The existing tun.checksumNoFold() function was this:

// TODO: Explore SIMD and/or other assembly optimizations. func checksumNoFold(b []byte, initial uint64) uint64 { ac := initial i := 0 n := len(b) for n >= 4 { ac += uint64(binary.BigEndian.Uint32(b[i : i+4])) n -= 4 i += 4 } for n >= 2 { ac += uint64(binary.BigEndian.Uint16(b[i : i+2])) n -= 2 i += 2 } if n == 1 { ac += uint64(b[i]) = 64 { ac += uint64(binary.BigEndian.Uint32(b[:4])) ac += uint64(binary.BigEndian.Uint32(b[4:8])) // (omitted) continues to 64 b = b[64:] } if len(b) >= 32 { ac += uint64(binary.BigEndian.Uint32(b[:4])) ac += uint64(binary.BigEndian.Uint32(b[4:8])) // (omitted) continues to 32 b = b[32:] } if len(b) >= 16 { ac += uint64(binary.BigEndian.Uint32(b[:4])) ac += uint64(binary.BigEndian.Uint32(b[4:8])) ac += uint64(binary.BigEndian.Uint32(b[8:12])) ac += uint64(binary.BigEndian.Uint32(b[12:16])) b = b[16:] } if len(b) >= 8 { ac += uint64(binary.BigEndian.Uint32(b[:4])) ac += uint64(binary.BigEndian.Uint32(b[4:8])) b = b[8:] } if len(b) >= 4 { ac += uint64(binary.BigEndian.Uint32(b)) b = b[4:] } if len(b) >= 2 { ac += uint64(binary.BigEndian.Uint16(b)) b = b[2:] } if len(b) == 1 { ac += uint64(b[0])

【本文地址】

公司简介

联系我们